Code
library(tidymodels)
library(tidyverse)
library(janitor)
library(naniar)
library(assertr)
library(corrplot)
library(gridExtra)
## Turn off scientific notation for readable numbers
options(scipen = 999)library(tidymodels)
library(tidyverse)
library(janitor)
library(naniar)
library(assertr)
library(corrplot)
library(gridExtra)
## Turn off scientific notation for readable numbers
options(scipen = 999)Filler out Missing Values in our key predicted outcome as this will cause our predicted models to fail, it provides no addtiaional data to our model. However for missingness in our predictors, imputation will help ensure non-systemic missingness adversely effects the sample size of our model.
## Read in the CSV file
remittances <- read_csv("../data/remittances.csv") |>
filter(!is.na(remittances_gdp))## Set seed for reproducibility
set.seed(20251211)
## Split data: 80% training, 20% testing
remit_split <- initial_split(data = remittances, prop = 0.8)
remit_train <- training(x = remit_split)
remit_test <- testing(x = remit_split)## View the first few rows and column types
glimpse(remit_train)Rows: 3,292
Columns: 19
$ `Country Name` <chr> "Germany", "Brazil", "Namibia", "Mexico", "Tanzania…
$ year <dbl> 2004, 2012, 2009, 2000, 2023, 2003, 2019, 2022, 201…
$ remittances <dbl> 6515541641, 2784072055, 76511275, 7524742980, 75881…
$ remittances_gdp <dbl> 0.22842973, 0.11293366, 0.85594119, 1.01403249, 0.9…
$ `Country Code` <chr> "DEU", "BRA", "NAM", "MEX", "TZA", "NAM", "CHE", "L…
$ gdp <dbl> 2852317768062, 2465227802807, 8938847189, 742061329…
$ stock <dbl> 1093030.29, 323363.20, NA, 8072288.08, NA, NA, 2336…
$ unemployment <dbl> 10.727, 7.251, 22.254, 2.646, 2.582, 22.052, 4.394,…
$ gdp_per <dbl> 34566.7359, 12521.7213, 4302.9137, 7524.0271, 1224.…
$ inflation <dbl> 1.1550593, 7.9431269, 6.9454043, 10.6979646, 2.7482…
$ vulnerable_emp <dbl> 7.044898, 25.346971, 28.026294, 31.785613, 83.87807…
$ maternal_mortality <dbl> 6, 60, 547, 56, 276, 290, 6, 477, 68, 101, 41, 842,…
$ exchange_rate <dbl> 0.8039216, 1.9530686, 8.5228198, 9.4555583, 2383.04…
$ deportations <dbl> 91, 639, 1, 44564, 19, 1, 4, 1, 4, NA, 53, 63, 533,…
$ internet <dbl> 64.73000, 48.56000, 6.50000, 5.08138, 29.06380, 3.3…
$ poverty <dbl> 0.0, 6.4, 38.6, 16.3, NA, 47.6, 0.0, NA, NA, 38.2, …
$ dist_pop <dbl> 6035.334, 7694.307, 11720.190, 3369.053, NA, 11720.…
$ dist_cap <dbl> 6717.542, 6794.436, 11908.000, 3037.916, NA, 11908.…
$ terror <dbl> 2, 4, 2, 3, NA, 3, 1, 3, 2, 2, 2, 3, 2, 2, 2, 2, 4,…
The training data has 3,918 rows and 19 columns. We can see:
## Get min, max, mean, median for all variables
summary(remit_train) Country Name year remittances remittances_gdp
Length:3292 Min. :1994 Min. : 6038 Min. : 0.000029
Class :character 1st Qu.:2002 1st Qu.: 74764744 1st Qu.: 0.366894
Mode :character Median :2010 Median : 545314436 Median : 1.608458
Mean :2010 Mean : 2655647852 Mean : 4.070970
3rd Qu.:2017 3rd Qu.: 2042426540 3rd Qu.: 4.620723
Max. :2024 Max. :137674533896 Max. :108.402724
Country Code gdp stock
Length:3292 Min. : 37184925 Min. : 259.6
Class :character 1st Qu.: 6495301634 1st Qu.: 39746.2
Mode :character Median : 25703593810 Median : 97162.5
Mean : 303934455901 Mean : 388798.9
3rd Qu.: 168683410868 3rd Qu.: 260319.4
Max. :18316765021700 Max. :23126089.8
NA's :1460
unemployment gdp_per inflation vulnerable_emp
Min. : 0.100 Min. : 109.6 Min. : -32.741 Min. : 0.1257
1st Qu.: 3.829 1st Qu.: 1356.2 1st Qu.: 1.834 1st Qu.:13.8649
Median : 6.232 Median : 4324.5 Median : 4.197 Median :33.2140
Mean : 7.820 Mean : 12367.7 Mean : 12.021 Mean :38.8326
3rd Qu.:10.491 3rd Qu.: 14711.9 3rd Qu.: 8.589 3rd Qu.:61.1256
Max. :34.007 Max. :138935.0 Max. :4800.532 Max. :94.7169
NA's :167 NA's :32 NA's :267
maternal_mortality exchange_rate deportations internet
Min. : 1.0 Min. : 0.0004 Min. : 0.0 Min. : 0.00
1st Qu.: 14.0 1st Qu.: 1.7617 1st Qu.: 7.0 1st Qu.: 3.53
Median : 61.0 Median : 8.2770 Median : 34.0 Median : 23.00
Mean : 172.7 Mean : 390.1688 Mean : 989.7 Mean : 33.84
3rd Qu.: 225.5 3rd Qu.: 116.3786 3rd Qu.: 134.0 3rd Qu.: 62.60
Max. :5721.0 Max. :15236.8847 Max. :90504.0 Max. :100.00
NA's :153 NA's :34 NA's :360 NA's :223
poverty dist_pop dist_cap terror
Min. : 0.00 Min. : 548.4 Min. : 737 Min. :1.000
1st Qu.: 0.25 1st Qu.: 6035.3 1st Qu.: 6274 1st Qu.:1.000
Median : 1.50 Median : 7873.0 Median : 8081 Median :2.000
Mean :10.33 Mean : 8471.3 Mean : 8618 Mean :2.345
3rd Qu.:11.55 3rd Qu.:11381.9 3rd Qu.:11674 3rd Qu.:3.000
Max. :89.40 Max. :16180.3 Max. :16371 Max. :5.000
NA's :1949 NA's :145 NA's :145 NA's :148
Key observations:
## Detailed structure of the data
str(remit_train)spc_tbl_ [3,292 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Country Name : chr [1:3292] "Germany" "Brazil" "Namibia" "Mexico" ...
$ year : num [1:3292] 2004 2012 2009 2000 2023 ...
$ remittances : num [1:3292] 6515541641 2784072055 76511275 7524742980 758814528 ...
$ remittances_gdp : num [1:3292] 0.228 0.113 0.856 1.014 0.96 ...
$ Country Code : chr [1:3292] "DEU" "BRA" "NAM" "MEX" ...
$ gdp : num [1:3292] 2852317768062 2465227802807 8938847189 742061329749 79062403837 ...
$ stock : num [1:3292] 1093030 323363 NA 8072288 NA ...
$ unemployment : num [1:3292] 10.73 7.25 22.25 2.65 2.58 ...
$ gdp_per : num [1:3292] 34567 12522 4303 7524 1224 ...
$ inflation : num [1:3292] 1.16 7.94 6.95 10.7 2.75 ...
$ vulnerable_emp : num [1:3292] 7.04 25.35 28.03 31.79 83.88 ...
$ maternal_mortality: num [1:3292] 6 60 547 56 276 290 6 477 68 101 ...
$ exchange_rate : num [1:3292] 0.804 1.953 8.523 9.456 2383.043 ...
$ deportations : num [1:3292] 91 639 1 44564 19 ...
$ internet : num [1:3292] 64.73 48.56 6.5 5.08 29.06 ...
$ poverty : num [1:3292] 0 6.4 38.6 16.3 NA 47.6 0 NA NA 38.2 ...
$ dist_pop : num [1:3292] 6035 7694 11720 3369 NA ...
$ dist_cap : num [1:3292] 6718 6794 11908 3038 NA ...
$ terror : num [1:3292] 2 4 2 3 NA 3 1 3 2 2 ...
- attr(*, "spec")=
.. cols(
.. `Country Name` = col_character(),
.. year = col_double(),
.. remittances = col_double(),
.. remittances_gdp = col_double(),
.. `Country Code` = col_character(),
.. gdp = col_double(),
.. stock = col_double(),
.. unemployment = col_double(),
.. gdp_per = col_double(),
.. inflation = col_double(),
.. vulnerable_emp = col_double(),
.. maternal_mortality = col_double(),
.. exchange_rate = col_double(),
.. deportations = col_double(),
.. internet = col_double(),
.. poverty = col_double(),
.. dist_pop = col_double(),
.. dist_cap = col_double(),
.. terror = col_double()
.. )
- attr(*, "problems")=<externalptr>
All numeric variables are stored as num or dbl (double precision), and text variables are stored as chr (character).
## Check current column names
names(remit_train) [1] "Country Name" "year" "remittances"
[4] "remittances_gdp" "Country Code" "gdp"
[7] "stock" "unemployment" "gdp_per"
[10] "inflation" "vulnerable_emp" "maternal_mortality"
[13] "exchange_rate" "deportations" "internet"
[16] "poverty" "dist_pop" "dist_cap"
[19] "terror"
Some column names have spaces and capital letters.
## Convert to lowercase with underscores
remit_train <- remit_train |>
clean_names()
## Verify the cleaned names
names(remit_train) [1] "country_name" "year" "remittances"
[4] "remittances_gdp" "country_code" "gdp"
[7] "stock" "unemployment" "gdp_per"
[10] "inflation" "vulnerable_emp" "maternal_mortality"
[13] "exchange_rate" "deportations" "internet"
[16] "poverty" "dist_pop" "dist_cap"
[19] "terror"
Now all column names are lowercase with underscores.
## Count NA values in each column
colSums(is.na(remit_train)) country_name year remittances remittances_gdp
0 0 0 0
country_code gdp stock unemployment
0 0 1460 167
gdp_per inflation vulnerable_emp maternal_mortality
0 32 267 153
exchange_rate deportations internet poverty
34 360 223 1949
dist_pop dist_cap terror
145 145 148
Several variables have missing data. Let’s calculate the percentage!
## Calculate percent missing for each variable
remit_train |>
summarise(across(everything(), ~sum(is.na(.)) / n() * 100))# A tibble: 1 × 19
country_name year remittances remittances_gdp country_code gdp stock
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0 44.3
# ℹ 12 more variables: unemployment <dbl>, gdp_per <dbl>, inflation <dbl>,
# vulnerable_emp <dbl>, maternal_mortality <dbl>, exchange_rate <dbl>,
# deportations <dbl>, internet <dbl>, poverty <dbl>, dist_pop <dbl>,
# dist_cap <dbl>, terror <dbl>
Major findings:
## Get min, max, mean, median, quartiles
summary(remit_train$remittances) Min. 1st Qu. Median Mean 3rd Qu. Max.
6038 74764744 545314436 2655647852 2042426540 137674533896
The mean ($2.55 billion) is much larger than the median ($529 million). The data is right-skewed.
## Create a point plot (from class notes Section 5.6.1)
remit_train |>
select(remittances, remittances_gdp, gdp, stock, unemployment,deportations) |>
pivot_longer(everything()) |>
ggplot(aes(value)) +
geom_histogram(bins = 30) +
facet_wrap(~name, scales = "free") +
theme_minimal()Warning: Removed 1987 rows containing non-finite outside the scale range
(`stat_bin()`).
#CAVEAT: I could not 'clean' the x axis (log would be the way to solve it but
#we want to show how skewed the distribution is)## Create a point plot (from class notes Section 5.6.1)
remit_train |>
ggplot(aes(remittances, 1)) +
geom_point(alpha = 0.2) +
scale_y_continuous(breaks = 0) +
labs(y = NULL, title = "Distribution of Remittances") +
theme_bw() +
theme(panel.border = element_blank())Most points cluster on the left (lower values) with a few extreme points on the right (confirms right-skewness).
## Create histogram with 30 bins
remit_train |>
ggplot(aes(x = remittances)) +
geom_histogram(bins = 30, fill = "steelblue") +
theme_minimal() +
labs(title = "Distribution of Remittances",
x = "Remittances (USD)",
y = "Count")The histogram is heavily concentrated on the left with a long tail to the right. This is classic right-skewed data.
## Create boxplot to see outliers
remit_train |>
ggplot(aes(y = remittances)) +
geom_boxplot(fill = "steelblue") +
theme_minimal() +
labs(title = "Boxplot of Remittances",
y = "Remittances (USD)")Many points appear above the upper whisker (outliers). These are likely large countries like Mexico that receive billions in remittances.
## Get summary statistics
summary(remit_train$gdp) Min. 1st Qu. Median Mean 3rd Qu.
37184925 6495301634 25703593810 303934455901 168683410868
Max.
18316765021690
GDP also shows huge range - from $21 million to $18.7 trillion.
## Create point plot for GDP
remit_train |>
ggplot(aes(gdp, 1)) +
geom_point(alpha = 0.2) +
scale_y_continuous(breaks = 0) +
labs(y = NULL, title = "Distribution of GDP") +
theme_bw() +
theme(panel.border = element_blank())GDP shows the same right-skewed pattern as remittances. Large economies have much higher GDP than small economies.
## Get summary statistics
summary(remit_train$unemployment) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.100 3.829 6.232 7.820 10.491 34.007 167
Unemployment ranges from 0.11% to 34.01%.
## Create point plot for unemployment
remit_train |>
ggplot(aes(unemployment, 1)) +
geom_point(alpha = 0.2) +
scale_y_continuous(breaks = 0) +
labs(y = NULL, title = "Distribution of Unemployment") +
theme_bw() +
theme(panel.border = element_blank())Warning: Removed 167 rows containing missing values or values outside the scale range
(`geom_point()`).
Unemployment appears more evenly distributed than remittances or GDP.
## Get summary statistics
summary(remit_train$inflation) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-32.741 1.834 4.197 12.021 8.589 4800.532 32
The maximum inflation is 6,041%! (outlier) Most inflation values are between 1.7% and 9%,
## Create point plot for inflation
remit_train |>
ggplot(aes(inflation, 1)) +
geom_point(alpha = 0.2) +
scale_y_continuous(breaks = 0) +
labs(y = NULL, title = "Distribution of Inflation") +
theme_bw() +
theme(panel.border = element_blank())Warning: Removed 32 rows containing missing values or values outside the scale range
(`geom_point()`).
## Test that remittances > 0 when not missing
remit_train |>
filter(!is.na(remittances)) |>
verify(remittances > 0) |>
summarise(mean_remittances = mean(remittances, na.rm = TRUE))# A tibble: 1 × 1
mean_remittances
<dbl>
1 2655647852.
All remittances are positive. The mean is $2.55 billion.
## Test that years are between 1994 and 2024
remit_train |>
verify(year >= 1994 & year <= 2024) |>
summarise(mean_year = mean(year))# A tibble: 1 × 1
mean_year
<dbl>
1 2010.
All years are within the expected range.
## Test that unemployment is between 0 and 100
remit_train |>
filter(!is.na(unemployment)) |>
verify(unemployment >= 0 & unemployment <= 100) |>
summarise(mean_unemployment = mean(unemployment, na.rm = TRUE))# A tibble: 1 × 1
mean_unemployment
<dbl>
1 7.82
All unemployment values are valid percentages (0-100%).
Since remittances and GDP are highly right-skewed, we need to create and examine log.
## Create log-transformed versions of skewed variables
remit_train <- remit_train |>
mutate(
log_remittances = log(remittances + 1),
log_gdp = log(gdp + 1)
)We add 1 before taking the log to handle any zero values (log(0) is undefined).
## Get summary statistics for log remittances
summary(remit_train$log_remittances) Min. 1st Qu. Median Mean 3rd Qu. Max.
8.706 18.130 20.117 19.744 21.437 25.648
## Create histogram for log-transformed remittances
remit_train |>
filter(!is.na(log_remittances)) |>
ggplot(aes(x = log_remittances)) +
geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
theme_minimal() +
labs(title = "Distribution of Log-Transformed Remittances",
subtitle = "Much more normal distribution after log transformation",
x = "Log(Remittances + 1)",
y = "Count")The log-transformed remittances show a much more normal distribution compared to the original right-skewed data.
## Create boxplot for log remittances
remit_train |>
filter(!is.na(log_remittances)) |>
ggplot(aes(y = log_remittances)) +
geom_boxplot(fill = "darkgreen") +
theme_minimal() +
labs(title = "Boxplot of Log-Transformed Remittances",
y = "Log(Remittances + 1)")Fewer outliers visible after log transformation.
## Get summary statistics for log GDP
summary(remit_train$log_gdp) Min. 1st Qu. Median Mean 3rd Qu. Max.
17.43 22.59 23.97 24.13 25.85 30.54
## Create histogram for log-transformed GDP
remit_train |>
filter(!is.na(log_gdp)) |>
ggplot(aes(x = log_gdp)) +
geom_histogram(bins = 30, fill = "darkblue", color = "white") +
theme_minimal() +
labs(title = "Distribution of Log-Transformed GDP",
subtitle = "More normal distribution after log transformation",
x = "Log(GDP + 1)",
y = "Count")Log GDP also shows a more normal distribution.
## Create comparison plots
p1 <- remit_train |>
filter(!is.na(remittances)) |>
ggplot(aes(x = remittances)) +
geom_histogram(bins = 30, fill = "steelblue") +
theme_minimal() +
labs(title = "Original Remittances (Right-Skewed)",
x = "Remittances (USD)")
p2 <- remit_train |>
filter(!is.na(log_remittances)) |>
ggplot(aes(x = log_remittances)) +
geom_histogram(bins = 30, fill = "darkgreen") +
theme_minimal() +
labs(title = "Log-Transformed Remittances (More Normal)",
x = "Log(Remittances + 1)")
grid.arrange(p1, p2, ncol = 2)The log transformation successfully converts the right-skewed distribution into a more normal distribution, which is better for modeling.
## Create comparison plots for GDP
p3 <- remit_train |>
filter(!is.na(gdp)) |>
ggplot(aes(x = gdp)) +
geom_histogram(bins = 30, fill = "steelblue") +
theme_minimal() +
labs(title = "Original GDP (Right-Skewed)",
x = "GDP (USD)")
p4 <- remit_train |>
filter(!is.na(log_gdp)) |>
ggplot(aes(x = log_gdp)) +
geom_histogram(bins = 30, fill = "darkblue") +
theme_minimal() +
labs(title = "Log-Transformed GDP (More Normal)",
x = "Log(GDP + 1)")
grid.arrange(p3, p4, ncol = 2)## Scatter plot with log-transformed variables
remit_train |>
filter(!is.na(log_gdp), !is.na(log_remittances)) |>
ggplot(aes(x = log_gdp, y = log_remittances)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
theme_minimal() +
labs(title = "Log GDP vs Log Remittances",
subtitle = "Clearer linear relationship after log transformation",
x = "Log(GDP + 1)",
y = "Log(Remittances + 1)")`geom_smooth()` using formula = 'y ~ x'
The relationship between log GDP and log remittances is more linear than the original variables, which will improve model performance.
ggplot(remit_train, aes(x = deportations + 1, y = remittances + 1)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
scale_x_log10() +
scale_y_log10() +
labs(
title = "Log–Log Relationship Between Deportations and Remittances",
x = "Log(Deportations + 1)",
y = "Log(Remittances + 1)"
) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 360 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 360 rows containing missing values or values outside the scale range
(`geom_point()`).
## Count distinct country names
n_distinct(remit_train$country_name)[1] 151
We have 158 different countries in the dataset.
## Show how many observations per country
head(table(remit_train$country_name), 20)
Afghanistan Albania Algeria Angola
16 25 27 12
Antigua and Barbuda Argentina Armenia Australia
25 22 23 21
Austria Azerbaijan Bangladesh Barbados
22 21 22 25
Belarus Belgium Belize Benin
26 22 20 26
Bermuda Bhutan Bolivia Botswana
16 12 29 26
Most countries have between 20-30 observations, representing roughly 20-30 years of data.
## Count observations per country and plot top 20
remit_train |>
count(country_name, sort = TRUE) |>
slice_head(n = 20) |>
ggplot(aes(x = reorder(country_name, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top 20 Countries by Number of Observations",
x = "Country",
y = "Count")Countries are fairly evenly represented. Bahrain has the most observations (30), while several countries have around 20-29 observations.
remit_train %>%
filter(year == 2024) %>%
slice_max(remittances, n = 15) %>%
ggplot(aes(
x = fct_reorder(country_name, remittances),
y = remittances / 1e9
)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 15 Remittance Receivers (2024)",
x = "Country",
y = "Remittances (Billions USD)"
) +
theme_minimal()# Top 15 by remittances/GDP (2024)
remit_train %>%
filter(year == 2024) %>%
slice_max(remittances_gdp, n = 15) %>% # preferred over top_n()
mutate(`Country Name` = fct_reorder(country_name, remittances_gdp)) %>%
ggplot(aes(x = `Country Name`, y = remittances_gdp)) +
geom_col(fill = "coral") +
coord_flip() +
labs(
title = "Top 15 Countries: Remittances as % GDP (2024)",
x = "Country",
y = "Remittances (% GDP)"
) +
theme_minimal()remit_train |>
group_by(country_name) |>
summarize(mean_ratio = mean(remittances_gdp, na.rm = TRUE)) |>
arrange(desc(mean_ratio)) |>
slice_head(n = 10)# A tibble: 10 × 2
country_name mean_ratio
<chr> <dbl>
1 Lesotho 42.9
2 Tonga 31.8
3 Bermuda 21.1
4 Nepal 20.2
5 Samoa 19.6
6 Lebanon 19.5
7 El Salvador 17.9
8 Kosovo 17.0
9 Jordan 16.1
10 Honduras 15.1
top10 <- remit_train |>
group_by(country_name) |>
summarize(mean_ratio = mean(remittances_gdp, na.rm = TRUE)) |>
arrange(desc(mean_ratio)) |>
slice_head(n = 10)
ggplot(top10, aes(x = reorder(country_name, mean_ratio), y = mean_ratio)) +
geom_col(fill = "coral") +
coord_flip() +
labs(
title = "Top 10 Countries by Average Remittances % of GDP",
x = "Country",
y = "Average Remittances/GDP"
) +
theme_minimal()remit_train |>
filter(country_name %in% c(
"Nicaragua", "El Salvador", "Honduras", "Guatemala", "Haiti", "India"
)) |>
ggplot(aes(x = year, y = remittances_gdp, color = country_name)) +
geom_line(linewidth = 1) +
scale_y_log10() +
labs(
title = "Remittance Trends Over Time",
subtitle = "Selected Countries (log scale)",
x = "Year",
y = "Remittances (% of GDP, log scale)",
color = "Country"
) +
theme_minimal()#Start assessing countries of interes
countries_of_interest <- c( "Nicaragua", "El Salvador", "Honduras", "Guatemala",
"Haiti", "India")
filtered <- remit_train |>
filter(country_name %in% countries_of_interest)
ggplot(filtered, aes(log(stock), log(remittances), color = country_name)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(stock)",
y = "log(remittances)",
title = "Stock–Remittance Relationship for Selected Countries",
color = "Country"
) +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 13 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).
#Comment: comparing the stock–Remittance Relationship for Selected Countries## Scatter plot with trend line
remit_train |>
ggplot(aes(x = gdp, y = remittances)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
theme_minimal() +
labs(title = "Relationship Between GDP and Remittances (Original Scale)",
x = "GDP (USD)",
y = "Remittances (USD)")`geom_smooth()` using formula = 'y ~ x'
There is a clear positive relationship. Countries with larger economies (higher GDP) tend to receive more remittances in absolute dollar amounts. The red line shows the linear trend.
## Scatter plot with log-transformed variables
remit_train |>
filter(!is.na(log_gdp), !is.na(log_remittances)) |>
ggplot(aes(x = log_gdp, y = log_remittances)) +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
theme_minimal() +
labs(title = "Log GDP vs Log Remittances (Log Scale - Better Linear Fit)",
subtitle = "This relationship is more appropriate for linear regression models",
x = "Log(GDP + 1)",
y = "Log(Remittances + 1)")`geom_smooth()` using formula = 'y ~ x'
The log-transformed relationship is more linear and will produce better model predictions.
## Scatter plot with trend line
remit_train |>
ggplot(aes(x = gdp_per, y = remittances_gdp)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
theme_minimal() +
labs(title = "GDP Per Capita vs Remittances as % of GDP",
x = "GDP Per Capita (USD)",
y = "Remittances as % of GDP")`geom_smooth()` using formula = 'y ~ x'
Poorer countries (lower GDP per capita) depend more heavily on remittances as a percentage of their economy. Richer countries receive remittances but they represent a smaller share of their total GDP.
## Scatter plot with trend line
remit_train |>
ggplot(aes(x = unemployment, y = remittances)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
theme_minimal() +
labs(title = "Unemployment vs Remittances",
x = "Unemployment Rate (%)",
y = "Remittances (USD)")`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 167 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 167 rows containing missing values or values outside the scale range
(`geom_point()`).
There is a slight negative relationship (it is weak). Unemployment doesn’t appear to be a strong predictor of remittances.
## Calculate correlations between all numeric variables
remit_train |>
select(where(is.numeric)) |>
cor(use = "complete.obs") |>
round(2) year remittances remittances_gdp gdp stock unemployment
year 1.00 0.18 0.07 0.13 -0.01 -0.07
remittances 0.18 1.00 0.02 0.47 0.47 -0.12
remittances_gdp 0.07 0.02 1.00 -0.21 0.03 0.03
gdp 0.13 0.47 -0.21 1.00 0.18 -0.09
stock -0.01 0.47 0.03 0.18 1.00 -0.13
unemployment -0.07 -0.12 0.03 -0.09 -0.13 1.00
gdp_per 0.22 0.02 -0.38 0.19 -0.09 -0.07
inflation -0.13 -0.06 0.03 -0.12 0.00 -0.06
vulnerable_emp -0.08 0.05 0.39 -0.10 0.06 -0.12
maternal_mortality -0.10 0.04 0.11 -0.11 -0.01 -0.19
exchange_rate 0.03 0.05 -0.05 -0.03 -0.05 -0.09
deportations -0.09 0.28 0.23 0.01 0.76 -0.13
internet 0.68 0.06 -0.27 0.20 -0.08 0.02
poverty -0.28 0.01 0.18 -0.11 0.04 -0.12
dist_pop 0.06 0.08 -0.09 0.13 -0.17 -0.06
dist_cap 0.06 0.09 -0.11 0.13 -0.19 -0.04
terror -0.09 0.21 0.24 0.04 0.21 -0.12
log_remittances 0.26 0.69 0.13 0.36 0.30 -0.13
log_gdp 0.14 0.45 -0.59 0.60 0.19 -0.06
gdp_per inflation vulnerable_emp maternal_mortality
year 0.22 -0.13 -0.08 -0.10
remittances 0.02 -0.06 0.05 0.04
remittances_gdp -0.38 0.03 0.39 0.11
gdp 0.19 -0.12 -0.10 -0.11
stock -0.09 0.00 0.06 -0.01
unemployment -0.07 -0.06 -0.12 -0.19
gdp_per 1.00 -0.30 -0.63 -0.35
inflation -0.30 1.00 0.15 0.17
vulnerable_emp -0.63 0.15 1.00 0.65
maternal_mortality -0.35 0.17 0.65 1.00
exchange_rate -0.18 0.07 0.27 0.24
deportations -0.14 0.02 0.09 0.00
internet 0.72 -0.28 -0.60 -0.43
poverty -0.44 0.25 0.71 0.74
dist_pop -0.19 0.11 0.28 0.22
dist_cap -0.15 0.10 0.25 0.21
terror -0.58 0.27 0.57 0.39
log_remittances 0.11 -0.14 -0.01 -0.06
log_gdp 0.51 -0.16 -0.40 -0.23
exchange_rate deportations internet poverty dist_pop
year 0.03 -0.09 0.68 -0.28 0.06
remittances 0.05 0.28 0.06 0.01 0.08
remittances_gdp -0.05 0.23 -0.27 0.18 -0.09
gdp -0.03 0.01 0.20 -0.11 0.13
stock -0.05 0.76 -0.08 0.04 -0.17
unemployment -0.09 -0.13 0.02 -0.12 -0.06
gdp_per -0.18 -0.14 0.72 -0.44 -0.19
inflation 0.07 0.02 -0.28 0.25 0.11
vulnerable_emp 0.27 0.09 -0.60 0.71 0.28
maternal_mortality 0.24 0.00 -0.43 0.74 0.22
exchange_rate 1.00 -0.04 -0.16 0.29 0.39
deportations -0.04 1.00 -0.19 0.13 -0.23
internet -0.16 -0.19 1.00 -0.60 -0.14
poverty 0.29 0.13 -0.60 1.00 0.23
dist_pop 0.39 -0.23 -0.14 0.23 1.00
dist_cap 0.37 -0.26 -0.11 0.21 1.00
terror 0.20 0.21 -0.52 0.47 0.27
log_remittances 0.08 0.21 0.19 -0.06 0.03
log_gdp 0.04 0.00 0.43 -0.27 0.10
dist_cap terror log_remittances log_gdp
year 0.06 -0.09 0.26 0.14
remittances 0.09 0.21 0.69 0.45
remittances_gdp -0.11 0.24 0.13 -0.59
gdp 0.13 0.04 0.36 0.60
stock -0.19 0.21 0.30 0.19
unemployment -0.04 -0.12 -0.13 -0.06
gdp_per -0.15 -0.58 0.11 0.51
inflation 0.10 0.27 -0.14 -0.16
vulnerable_emp 0.25 0.57 -0.01 -0.40
maternal_mortality 0.21 0.39 -0.06 -0.23
exchange_rate 0.37 0.20 0.08 0.04
deportations -0.26 0.21 0.21 0.00
internet -0.11 -0.52 0.19 0.43
poverty 0.21 0.47 -0.06 -0.27
dist_pop 1.00 0.27 0.03 0.10
dist_cap 1.00 0.24 0.04 0.12
terror 0.24 1.00 0.20 -0.08
log_remittances 0.04 0.20 1.00 0.53
log_gdp 0.12 -0.08 0.53 1.00
## Create visual correlation matrix (Eva's example)
remit_train |>
select(where(is.numeric)) |>
cor(use = "complete.obs") |>
corrplot(method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
title = "Correlation Matrix",
mar = c(0,0,2,0))## Create visual correlation matrix (Gaby's example)
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
#Create variable to represent numeric vars
numeric_vars_log <- remit_train %>%
select(remittances_gdp, remittances, stock, unemployment, gdp_per,
inflation, internet, dist_cap, terror) %>%
mutate(
remittances_gdp = log1p(remittances_gdp),
remittances = log1p(remittances),
stock = log1p(stock), # migrant stock
gdp_per = log1p(gdp_per),
internet = log1p(internet),
dist_cap = log1p(dist_cap)
) %>%
na.omit()
cor_matrix_log <- cor(numeric_vars_log, use = "complete.obs")
melted_cor_log <- melt(cor_matrix_log)
ggplot(melted_cor_log, aes(Var1, Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), size = 3) +
scale_fill_gradient2(
low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1)
) +
labs(title = "Correlation Heatmap (Log-Transformed Variables)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))## Calculate average remittances per year and plot
remit_train |>
group_by(year) |>
summarise(avg_remittances = mean(remittances, na.rm = TRUE)) |>
ggplot(aes(x = year, y = avg_remittances)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue") +
theme_minimal() +
labs(title = "Average Remittances Over Time (1994-2024)",
x = "Year",
y = "Average Remittances (USD)")Remittances show a clear upward trend over 30 years. We can see:
## Calculate average remittances as % of GDP per year and plot
remit_train |>
group_by(year) |>
summarise(avg_remittances_gdp = mean(remittances_gdp, na.rm = TRUE)) |>
ggplot(aes(x = year, y = avg_remittances_gdp)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue") +
theme_minimal() +
labs(title = "Average Remittances as % of GDP Over Time",
x = "Year",
y = "Remittances as % of GDP")Remittances as a percentage of GDP have stayed relatively stable around 3-4% over time. This means remittances are growing roughly in line with GDP growth, not becoming more or less important to economies over time.
It is likely that past changes to country of origin conditions is likely to be more illustrative of future remittances than current conditions. For example, while a downward shock in GDP may influence current migration, migrants may take time to settle into the US and thus begin remitting back home.
These lagged effect would seem to be the most plausible with GDP, unemployment, terror, deportations, and changes in migrant stock (inward migration)
## Lagged (1 year)
remit_lag <- remit_train |>
mutate(year = as.numeric(year)) |>
arrange(country_name, year) |>
group_by(country_name) |>
mutate(
gdp_lag = lag(gdp_per),
unemp_lag = lag(unemployment),
terror_lag = lag(terror),
deportations_lag = lag(deportations),
stock_lag = lag(stock),
) |>
ungroup()
# Verifying that the lag worked.
remit_lag |>
select(country_name, year, gdp_per, gdp_lag) |>
arrange(country_name, year) |>
filter(!is.na(gdp_lag)) |>
slice_head(n = 10)# A tibble: 10 × 4
country_name year gdp_per gdp_lag
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 2009 452. 382.
2 Afghanistan 2010 561. 452.
3 Afghanistan 2011 607. 561.
4 Afghanistan 2012 651. 607.
5 Afghanistan 2013 637. 651.
6 Afghanistan 2014 625. 637.
7 Afghanistan 2015 566. 625.
8 Afghanistan 2016 522. 566.
9 Afghanistan 2017 525. 522.
10 Afghanistan 2018 491. 525.
## Lagged predictors relationship with remittances (as % of GDP)
## GDP per capita
# Lagged vs Unlagged
remit_lag |>
pivot_longer(cols = c(gdp_per, gdp_lag),
names_to = "type",
values_to = "value") |>
ggplot(aes(value, remittances_gdp, color = type)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
theme_minimal() +
labs(title = "Lagged vs Current GDP per Capita",
x = "GDP per capita",
color = "Variable") ## Doesn't Necessarily Improve Model Fit. Overall there seem to be better ways to verify whether lagged variable would improve the interpretability of our models.
## Comparing Lagged vs Current GDP per capita for key countries.
remit_lag |>
filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
"Guatemala", "Haiti", "India")) |>
pivot_longer(
cols = c(gdp_per, gdp_lag),
names_to = "gdp_type",
values_to = "gdp_value"
) |>
ggplot(aes(x = gdp_value, y = remittances_gdp,
color = country_name, linetype = gdp_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Lagged vs Current GDP per Capita",
x = "GDP per capita (current or lagged)",
linetype = "GDP variable"
)Suggestion is that Lagged GDP demonstrates a slightly stronger relationship and thus may improve model fit. Thus it may seem that shocks or changes to prior GDP could help explain current remittances amounts.
## Comparing Lagged vs Current Unemployment for key countries.
remit_lag |>
filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
"Guatemala", "Haiti", "India")) |>
pivot_longer(
cols = c(unemployment, unemp_lag),
names_to = "unemp_type",
values_to = "unemp_value"
) |>
ggplot(aes(x = unemp_value, y = remittances_gdp,
color = country_name, linetype = unemp_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Lagged vs Current Unemployment",
x = "Unemployment (current or lagged)",
linetype = "GDP variable"
)For most countries lagged unemployment does not seem to alter model fit substantially for any country other than Haiti.
It likely won’t improve our model fit and thus shouldn’t be included.
## Comparing Lagged vs Current Terror for key countries.
remit_lag |>
filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
"Guatemala", "Haiti", "India")) |>
pivot_longer(
cols = c(terror, terror_lag),
names_to = "terror_type",
values_to = "terror_value"
) |>
ggplot(aes(x = terror_value, y = remittances_gdp,
color = country_name, linetype = terror_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Lagged vs Current Terror",
x = "Terror (current or lagged)",
linetype = "Terror variable"
)It seems terror varies less, and lagged terror levels may not be too explanatory
## Comparing Lagged vs Current Deportations for key countries.
remit_lag |>
filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
"Guatemala", "Haiti", "India")) |>
pivot_longer(
cols = c(deportations, deportations_lag),
names_to = "deportations_type",
values_to = "deportations_value"
) |>
ggplot(aes(x = deportations_value, y = remittances_gdp,
color = country_name, linetype = deportations_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Lagged vs Current Deportations",
x = "Deportations (current or lagged)",
linetype = "Deportations variable"
)Much stronger relationship for key countries in including the lagged effects of deportations in explaining future remittances.
## Comparing Lagged vs Current Deportations for key countries.
remit_lag |>
filter(country_name %in% c("Nicaragua", "El Salvador", "Honduras",
"Guatemala", "Haiti", "India")) |>
pivot_longer(
cols = c(stock, stock_lag),
names_to = "stock_type",
values_to = "stock_value"
) |>
ggplot(aes(x = stock_value, y = remittances_gdp,
color = country_name, linetype = stock_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Lagged vs Current Changes in Migrant Stock ",
x = "Migrant Stock (current or lagged)",
linetype = "Migrant Stock variable"
)Less Strong change and thus probably doesn’t warrant inclusion.
Takeaways:
Predictive power of lagged deportations and GDP improve model fit the best (shift the slopes of our relationships most).
For the other predictors, including changes to migrant stock, terror, and unemployment the relationships for our key variables barely changed indicating no changes in explanatory power.
Next Steps: using step_mutate in our recipe to add lags to our gdp_per and deportations would account for this.